Effective Early Termination Techniques for Text Similarity Join Operator
نویسندگان
چکیده
Text similarity join operator joins two relations if their join attributes are textually similar to each other, and it has a variety of application domains including integration and querying of data from heterogeneous resources; cleansing of data; and mining of data. Although, the text similarity join operator is widely used, its processing is expensive due to the huge number of similarity computations performed. In this paper, we incorporate some short cut evaluation techniques from the Information Retrieval domain, namely Harman, quit, continue, and maximal similarity filter heuristics, into the previously proposed text similarity join algorithms to reduce the amount of similarity computations needed during the join operation. We experimentally evaluate the original and the heuristic based similarity join algorithms using real data obtained from the DBLP Bibliography database, and observe performance improvements with continue and maximal similarity filter heuristics.
منابع مشابه
Exploiting Database Similarity Joins for Metric Spaces
Similarity Joins are recognized among the most useful data processing and analysis operations and are extensively used in multiple application domains. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. Multiple Similarity Join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and exter...
متن کاملSimilarity Joins: Their implementation and interactions with other database operators
Similarity Joins are extensively used in multiple application domains and are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several standalone implementations have been proposed, very little work has addressed the implementation of Similarity Joins as physical database opera...
متن کاملA Wider Concept for Similarity Joins
Join is one of the most studied and employed retrieval operators made available by the modern relational database management systems (RDBMSs). This binary operator is algebraically defined as a Cartesian product followed by the selection operator that specifies the join condition. In modern RDBMS, the join condition employs comparison operators based both on equality and on the Total Ordering R...
متن کاملSet Similarity Join on Probabilistic Data
Set similarity join has played an important role in many real-worldapplications such as data cleaning, near duplication detection, dataintegration, and so on. In these applications, set data often con-tain noises and are thus uncertain and imprecise. In this paper, wemodel such probabilistic set data on two uncertainty levels, that is,set and element levels. Based on them, w...
متن کاملDatabase Similarity Join for Metric Spaces
Similarity Joins are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several standalone implementations have been proposed, very little work has addressed the implementation of Similarity Join as a physical database operator. In this paper, we focus on the study, design and im...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005